{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "ba450fb1-8a26-4894-ab7a-5d7bfefe90ce",
   "metadata": {},
   "source": [
    "<table style=\"width:100%\">\n",
    "<tr>\n",
    "<td style=\"vertical-align:middle; text-align:left;\">\n",
    "<font size=\"2\">\n",
    "Supplementary code for the <a href=\"http://mng.bz/orYv\">Build a Large Language Model From Scratch</a> book by <a href=\"https://sebastianraschka.com\">Sebastian Raschka</a><br>\n",
    "<br>Code repository: <a href=\"https://github.com/rasbt/LLMs-from-scratch\">https://github.com/rasbt/LLMs-from-scratch</a>\n",
    "<br>汉化的库: <a href=\"https://github.com/GoatCsu/CN-LLMs-from-scratch.git\">https://github.com/GoatCsu/CN-LLMs-from-scratch.git</a>\n",
    "</font>\n",
    "</td>\n",
    "<td style=\"vertical-align:middle; text-align:left;\">\n",
    "<a href=\"http://mng.bz/orYv\"><img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp\" width=\"100px\"></a>\n",
    "</td>\n",
    "</tr>\n",
    "</table>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "51c9672d-8d0c-470d-ac2d-1271f8ec3f14",
   "metadata": {},
   "source": [
    "# 第五章 课后练习"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "37aa4692-2357-4d88-b072-6d2d988d7f4f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "numpy version: 1.26.4\n",
      "tiktoken version: 0.7.0\n",
      "torch version: 2.4.0\n",
      "tensorflow version: 2.16.1\n"
     ]
    }
   ],
   "source": [
    "from importlib.metadata import version\n",
    "\n",
    "pkgs = [\"numpy\", \n",
    "        \"tiktoken\", \n",
    "        \"torch\",\n",
    "        \"tensorflow\" # For OpenAI's pretrained weights\n",
    "       ]\n",
    "for p in pkgs:\n",
    "    print(f\"{p} version: {version(p)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5fea8be3-30a1-4623-a6d7-b095c6c1092e",
   "metadata": {},
   "source": [
    "# 练习 5.1：基于温度缩放(Temperature Scaling)的 Softmax 得分与采样概率"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5860ba9f-2db3-4480-b96b-4be1c68981eb",
   "metadata": {},
   "source": [
    "- 我们可以使用本节中定义的 `print_sampled_tokens` 函数来打印“pizza”这个词被采样的次数。\n",
    "- 我们从第 5.3.1 节中定义的代码开始。\n",
    "\n",
    "- 当温度为 0 或 0.1 时，它的采样次数为 0x，而当温度被调高到 5 时，它的采样次数为 32x。估计的概率是 32/1000 * 100% = 3.2%。\n",
    "\n",
    "- 实际的概率是 4.3%，并包含在重新缩放的 softmax 概率张量（`scaled_probas[2][6]`）中。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9cba59c2-a8a3-4af3-add4-70230795225e",
   "metadata": {},
   "source": [
    "- 以下是一个完整的示例，使用了第 5 章中的代码："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "42dda298-3014-4c36-8d63-97c210bcf4e8",
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "\n",
    "vocab = { \n",
    "    \"closer\": 0,\n",
    "    \"every\": 1, \n",
    "    \"effort\": 2, \n",
    "    \"forward\": 3,\n",
    "    \"inches\": 4,\n",
    "    \"moves\": 5, \n",
    "    \"pizza\": 6,\n",
    "    \"toward\": 7,\n",
    "    \"you\": 8,\n",
    "} \n",
    "inverse_vocab = {v: k for k, v in vocab.items()}\n",
    "\n",
    "next_token_logits = torch.tensor(\n",
    "    [4.51, 0.89, -1.90, 6.75, 1.63, -1.62, -1.89, 6.28, 1.79]\n",
    ")\n",
    "\n",
    "def print_sampled_tokens(probas):\n",
    "    torch.manual_seed(123)\n",
    "    sample = [torch.multinomial(probas, num_samples=1).item() for i in range(1_000)]\n",
    "    sampled_ids = torch.bincount(torch.tensor(sample))\n",
    "    for i, freq in enumerate(sampled_ids):\n",
    "        print(f\"{freq} x {inverse_vocab[i]}\")\n",
    "\n",
    "\n",
    "def softmax_with_temperature(logits, temperature):\n",
    "    scaled_logits = logits / temperature\n",
    "    return torch.softmax(scaled_logits, dim=0)\n",
    "\n",
    "\n",
    "temperatures = [1, 0.1, 5]  # Original, higher, and lower temperature\n",
    "scaled_probas = [softmax_with_temperature(next_token_logits, T) for T in temperatures]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1ee0f9f3-4132-42c7-8324-252fd8f59145",
   "metadata": {},
   "source": [
    "- 现在，我们可以遍历 `scaled_probas` 并打印每种情况下的采样频率："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "b5605236-e300-4844-aea7-509d868efbdd",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "Temperature: 1\n",
      "73 x closer\n",
      "0 x every\n",
      "0 x effort\n",
      "582 x forward\n",
      "2 x inches\n",
      "0 x moves\n",
      "0 x pizza\n",
      "343 x toward\n",
      "\n",
      "\n",
      "Temperature: 0.1\n",
      "0 x closer\n",
      "0 x every\n",
      "0 x effort\n",
      "985 x forward\n",
      "0 x inches\n",
      "0 x moves\n",
      "0 x pizza\n",
      "15 x toward\n",
      "\n",
      "\n",
      "Temperature: 5\n",
      "165 x closer\n",
      "75 x every\n",
      "42 x effort\n",
      "239 x forward\n",
      "71 x inches\n",
      "46 x moves\n",
      "32 x pizza\n",
      "227 x toward\n",
      "103 x you\n"
     ]
    }
   ],
   "source": [
    "for i, probas in enumerate(scaled_probas):\n",
    "    print(\"\\n\\nTemperature:\", temperatures[i])\n",
    "    print_sampled_tokens(probas)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fbf88c97-19c4-462c-924a-411c8c765d2c",
   "metadata": {},
   "source": [
    "- 请注意，采样提供了一个对实际概率的近似，当“pizza”这个词被采样时。\n",
    "- 例如，如果它被采样了 32/1000 次，估计的概率是 3.2%。\n",
    "- 要获得实际的概率，我们可以通过访问 `scaled_probas` 中对应的条目来直接查看概率。\n",
    "\n",
    "- 由于“pizza”是词汇表中的第 7 个条目，在温度为 5 时，我们可以按如下方式获取它："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "1d4163c0-22ad-4f5b-8e20-b7420e9dbfc6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor(0.0430)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "temp5_idx = 2\n",
    "pizza_idx = 6\n",
    "\n",
    "scaled_probas[temp5_idx][pizza_idx]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d3dcb438-5f18-4332-9627-66009f30a1a4",
   "metadata": {},
   "source": [
    "如果温度设置为 5，\"pizza\" 这个词被采样的概率为 4.3%。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b510ffb0-adca-4d64-8a12-38c4646fd736",
   "metadata": {},
   "source": [
    "# 练习 5.2：不同温度和 top-k 设置"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "10263328",
   "metadata": {},
   "source": [
    "- 译者的理解:\n",
    "  - 温度”（temperature）通常控制采样的多样性\n",
    "  - “top-k”则是指从概率最高的前 k 个词中采样,保证确定性"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "884990db-d1a6-4c4e-8e36-2c1e4c1e67c7",
   "metadata": {},
   "source": [
    "- 温度和 top-k 参数设置需要根据具体的 LLM 调整（这是一种试错过程，直到生成理想的输出）。\n",
    "- 然而，理想的结果也是应用特定的：\n",
    "  - 较低的 top-k 和温度会导致较少随机的输出，这在创建教育内容、技术写作、问答、数据分析、代码生成等任务中是更为理想的。\n",
    "  - 较高的 top-k 和温度会导致更具多样性和随机性的输出，这在头脑风暴、创意写作等任务中更为理想。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3f35425d-529d-4179-a1c4-63cb8b25b156",
   "metadata": {},
   "source": [
    "# 练习 5.3：解码函数中的确定性行为"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d12229a2-1d52-46ff-b1e8-198f2e58a7d2",
   "metadata": {},
   "source": [
    "有多种方式可以强制 `generate` 函数实现确定性行为：\n",
    "\n",
    "1. 将 `top_k=None`，并且不应用温度缩放；\n",
    "2. 设置 `top_k=1`。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "391c5dc8-8dd7-4a0a-90bd-519b72f528c7",
   "metadata": {},
   "source": [
    "- 以下是一个完整的示例，使用了第 5 章中的代码："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "a61a4034-797a-4635-bf42-ddfff1b07125",
   "metadata": {},
   "outputs": [],
   "source": [
    "import tiktoken\n",
    "import torch\n",
    "from previous_chapters import GPTModel\n",
    "\n",
    "\n",
    "GPT_CONFIG_124M = {\n",
    "    \"vocab_size\": 50257,  # 词汇表大小\n",
    "    \"context_length\": 256,       # 缩短的上下文长度（原始值：1024）\n",
    "    \"emb_dim\": 768,       # 嵌入维度\n",
    "    \"n_heads\": 12,        # 注意力头数\n",
    "    \"n_layers\": 12,       # 层数\n",
    "    \"drop_rate\": 0.1,     # 丢弃率\n",
    "    \"qkv_bias\": False     # 查询-键-值偏置\n",
    "}\n",
    "\n",
    "\n",
    "torch.manual_seed(123)\n",
    "\n",
    "tokenizer = tiktoken.get_encoding(\"gpt2\")\n",
    "model = GPTModel(GPT_CONFIG_124M)\n",
    "model.load_state_dict(torch.load(\"model.pth\", weights_only=True))\n",
    "model.eval();"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "ee95a272-b852-43b4-9827-ea7e1dbd5724",
   "metadata": {},
   "outputs": [],
   "source": [
    "from gpt_generate import generate, text_to_token_ids, token_ids_to_text\n",
    "from previous_chapters import generate_text_simple"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "4ab43658-3240-484a-9072-a40a0ed85be6",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Output text:\n",
      " Every effort moves you know,\" was one of the axioms he laid down across the Sevres and silver of an exquisitely appointed lun\n"
     ]
    }
   ],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "ebb22d06-393a-42d3-ab64-66646d33b39b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Output text:\n",
      " Every effort moves you know,\" was one of the axioms he laid down across the Sevres and silver of an exquisitely appointed lun\n"
     ]
    }
   ],
   "source": [
    "# 使用torch.argmax的确定性函数\n",
    "\n",
    "start_context = \"Every effort moves you\"\n",
    "\n",
    "token_ids = generate_text_simple(\n",
    "    model=model,\n",
    "    idx=text_to_token_ids(start_context, tokenizer),\n",
    "    max_new_tokens=25,\n",
    "    context_size=GPT_CONFIG_124M[\"context_length\"]\n",
    ")\n",
    "\n",
    "print(\"生成的文本:\\n\", token_ids_to_text(token_ids, tokenizer))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c85b1f11-37a5-477d-9c2d-170a6865e669",
   "metadata": {},
   "source": [
    "- Note that re-executing the previous code cell will produce the exact same generated text:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "75469f24-47cc-458d-a200-fe64c648131d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Output text:\n",
      " Every effort moves you know,\" was one of the axioms he laid down across the Sevres and silver of an exquisitely appointed lun\n"
     ]
    }
   ],
   "source": [
    "# 确定性行为：没有top_k，没有温度缩放\n",
    "\n",
    "token_ids = generate(\n",
    "    model=model,\n",
    "    idx=text_to_token_ids(\"Every effort moves you\", tokenizer),\n",
    "    max_new_tokens=25,\n",
    "    context_size=GPT_CONFIG_124M[\"context_length\"],\n",
    "    top_k=None,\n",
    "    temperature=0.0\n",
    ")\n",
    "\n",
    "print(\"生成的文本:\\n\", token_ids_to_text(token_ids, tokenizer))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6d0480e5-fb4e-41f8-a161-7ac980d71d47",
   "metadata": {},
   "source": [
    "# 练习 5.4: 后续预训练"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f40044e8-a0f5-476c-99fd-489b999fd80a",
   "metadata": {},
   "source": [
    "# 如果我们仍然在第5章首次训练模型的Python会话中，要继续预训练一个轮次，我们只需要加载在主章节中保存的模型和优化器，并再次调用`train_model_simple`函数\n",
    "\n",
    "# 在新的代码环境中确保结果可复现需要多几个步骤\n",
    "# 首先，我们加载分词器、模型和优化器："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "94eae6ba-d9fd-417a-8e31-fc39e9299870",
   "metadata": {},
   "outputs": [],
   "source": [
    "import tiktoken\n",
    "import torch\n",
    "from previous_chapters import GPTModel\n",
    "\n",
    "\n",
    "GPT_CONFIG_124M = {\n",
    "    \"vocab_size\": 50257,   # 词汇表大小\n",
    "    \"context_length\": 256, # 缩短的上下文长度（原始值：1024）\n",
    "    \"emb_dim\": 768,        # 嵌入维度\n",
    "    \"n_heads\": 12,         # 注意力头数\n",
    "    \"n_layers\": 12,        # 层数\n",
    "    \"drop_rate\": 0.1,      # 丢弃率\n",
    "    \"qkv_bias\": False      # 查询-键-值偏置\n",
    "}\n",
    "\n",
    "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
    "\n",
    "tokenizer = tiktoken.get_encoding(\"gpt2\")\n",
    "\n",
    "checkpoint = torch.load(\"model_and_optimizer.pth\", weights_only=True)\n",
    "model = GPTModel(GPT_CONFIG_124M)\n",
    "model.load_state_dict(checkpoint[\"model_state_dict\"])\n",
    "model.to(device)\n",
    "\n",
    "optimizer = torch.optim.AdamW(model.parameters(), lr=0.0004, weight_decay=0.1)\n",
    "optimizer.load_state_dict(checkpoint[\"optimizer_state_dict\"])\n",
    "model.train();"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "688fce4a-9ab2-4d97-a95c-fef02c32b4f3",
   "metadata": {},
   "source": [
    "- 接下来初始化载入器"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "b5a78470-0652-4abd-875a-664e23c07c36",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import urllib.request\n",
    "from previous_chapters import create_dataloader_v1\n",
    "\n",
    "\n",
    "file_path = \"the-verdict.txt\"\n",
    "url = \"https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt\"\n",
    "\n",
    "if not os.path.exists(file_path):\n",
    "    with urllib.request.urlopen(url) as response:\n",
    "        text_data = response.read().decode('utf-8')\n",
    "    with open(file_path, \"w\", encoding=\"utf-8\") as file:\n",
    "        file.write(text_data)\n",
    "else:\n",
    "    with open(file_path, \"r\", encoding=\"utf-8\") as file:\n",
    "        text_data = file.read()\n",
    "\n",
    "\n",
    "# 训练/验证集比例\n",
    "train_ratio = 0.90\n",
    "split_idx = int(train_ratio * len(text_data))\n",
    "train_data = text_data[:split_idx]\n",
    "val_data = text_data[split_idx:]\n",
    "\n",
    "\n",
    "torch.manual_seed(123)\n",
    "\n",
    "train_loader = create_dataloader_v1(\n",
    "    train_data,\n",
    "    batch_size=2,\n",
    "    max_length=GPT_CONFIG_124M[\"context_length\"],\n",
    "    stride=GPT_CONFIG_124M[\"context_length\"],\n",
    "    drop_last=True,\n",
    "    shuffle=True,\n",
    "    num_workers=0\n",
    ")\n",
    "\n",
    "val_loader = create_dataloader_v1(\n",
    "    val_data,\n",
    "    batch_size=2,\n",
    "    max_length=GPT_CONFIG_124M[\"context_length\"],\n",
    "    stride=GPT_CONFIG_124M[\"context_length\"],\n",
    "    drop_last=False,\n",
    "    shuffle=False,\n",
    "    num_workers=0\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "76598ef8-165c-4bcc-af5e-b6fe72398365",
   "metadata": {},
   "source": [
    " 最后，我们使用`train_model_simple`函数来训练模型："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "ab4693dc-1359-47a7-8110-1e90f514a49e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Ep 1 (Step 000000): Train loss 0.271, Val loss 6.545\n",
      "Ep 1 (Step 000005): Train loss 0.244, Val loss 6.614\n",
      "Every effort moves you?\"  \"Yes--quite insensible to the irony. She wanted him vindicated--and by me!\"  He laughed again, and threw back his head to look up at the sketch of the donkey. \"There were days when I\n"
     ]
    }
   ],
   "source": [
    "from gpt_train import train_model_simple\n",
    "\n",
    "num_epochs = 1\n",
    "train_losses, val_losses, tokens_seen = train_model_simple(\n",
    "    model, train_loader, val_loader, optimizer, device,\n",
    "    num_epochs=num_epochs, eval_freq=5, eval_iter=5,\n",
    "    start_context=\"Every effort moves you\", tokenizer=tokenizer\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3384e788-f5a1-407c-8dd1-87959b75026d",
   "metadata": {},
   "source": [
    "# 练习 5.5: 预训练模型的训练和验证集Loss"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7cb1140b-2027-4156-8d19-600ac849edbe",
   "metadata": {},
   "source": [
    "我们可以使用以下代码计算GPT模型的训练集和验证集损失：\n",
    "\n",
    "```python\n",
    "train_loss = calc_loss_loader(train_loader, gpt, device)\n",
    "val_loss = calc_loss_loader(val_loader, gpt, device)\n",
    "```\n",
    "\n",
    "- 124M模型的Loss如下\n",
    "\n",
    "```\n",
    "Training loss: 3.754748503367106\n",
    "Validation loss: 3.559617757797241\n",
    "```\n",
    "- 主要观察结果:训练集和验证集的表现差不多\n",
    "- 这可能有多种解释：\n",
    "\n",
    "1. The Verdict并不是OpenAI在训练GPT-2时使用的预训练数据集的一部分。因此，模型并没有显式地对训练集发生过拟合，并且在The Verdict的训练集和验证集部分表现相似。（在深度学习中，验证集的损失略低于训练集的损失，这并不常见。由于数据集相对较小，这种现象可能是由随机噪声引起的。在实际应用中，如果没有发生过拟合，训练集和验证集的表现通常是大致相同的）。\n",
    "\n",
    "2. The Verdict是GPT-2训练数据集的一部分。在这种情况下，我们无法判断模型是否在训练数据上发生了过拟合，因为验证集也会用于训练。为了评估过拟合的程度，我们需要一个在OpenAI完成训练GPT-2后生成的新数据集，以确保该数据集不可能是预训练数据的一部分。\n",
    "主要观察结果是训练集和验证集的表现差不多\n",
    "这可能有多种解释：\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "66bb4316-a57c-437f-9a01-fe99b1678524",
   "metadata": {},
   "source": [
    "下面的代码是一个可复现的独立示例"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "68d162d6-bbb9-4d6d-82ee-1c410694f872",
   "metadata": {},
   "outputs": [],
   "source": [
    "import tiktoken\n",
    "import torch\n",
    "from previous_chapters import GPTModel\n",
    "\n",
    "\n",
    "GPT_CONFIG_124M = {\n",
    "    \"vocab_size\": 50257,   # 词汇表大小\n",
    "    \"context_length\": 256, # 缩短的上下文长度（原始值：1024）\n",
    "    \"emb_dim\": 768,        # 嵌入维度\n",
    "    \"n_heads\": 12,         # 注意力头数\n",
    "    \"n_layers\": 12,        # 层数\n",
    "    \"drop_rate\": 0.1,      # 丢弃率\n",
    "    \"qkv_bias\": False      # 查询-键-值偏置\n",
    "}\n",
    "\n",
    "\n",
    "torch.manual_seed(123)\n",
    "\n",
    "tokenizer = tiktoken.get_encoding(\"gpt2\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "d8373461-7dad-47da-a489-3e23f0799b23",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "File already exists and is up-to-date: gpt2/124M/checkpoint\n",
      "File already exists and is up-to-date: gpt2/124M/encoder.json\n",
      "File already exists and is up-to-date: gpt2/124M/hparams.json\n",
      "File already exists and is up-to-date: gpt2/124M/model.ckpt.data-00000-of-00001\n",
      "File already exists and is up-to-date: gpt2/124M/model.ckpt.index\n",
      "File already exists and is up-to-date: gpt2/124M/model.ckpt.meta\n",
      "File already exists and is up-to-date: gpt2/124M/vocab.bpe\n"
     ]
    }
   ],
   "source": [
    "from gpt_download import download_and_load_gpt2\n",
    "\n",
    "settings, params = download_and_load_gpt2(model_size=\"124M\", models_dir=\"gpt2\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "cdd44873-d6c2-4471-a20f-f639b09fdcd3",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 在字典中定义模型配置以简化表示\n",
    "model_configs = {\n",
    "    \"gpt2-small (124M)\": {\"emb_dim\": 768, \"n_layers\": 12, \"n_heads\": 12},\n",
    "    \"gpt2-medium (355M)\": {\"emb_dim\": 1024, \"n_layers\": 24, \"n_heads\": 16},\n",
    "    \"gpt2-large (774M)\": {\"emb_dim\": 1280, \"n_layers\": 36, \"n_heads\": 20},\n",
    "    \"gpt2-xl (1558M)\": {\"emb_dim\": 1600, \"n_layers\": 48, \"n_heads\": 25},\n",
    "}\n",
    "\n",
    "# 复制基础配置并根据特定模型设置进行更新\n",
    "model_name = \"gpt2-small (124M)\"  # 示例模型名称\n",
    "NEW_CONFIG = GPT_CONFIG_124M.copy()\n",
    "NEW_CONFIG.update(model_configs[model_name])\n",
    "NEW_CONFIG.update({\"context_length\": 1024, \"qkv_bias\": True})\n",
    "\n",
    "gpt = GPTModel(NEW_CONFIG)\n",
    "gpt.eval();"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "c7d562e4-33f6-4611-9b75-6ad1cb441d3b",
   "metadata": {},
   "outputs": [],
   "source": [
    "from gpt_generate import load_weights_into_gpt\n",
    "\n",
    "\n",
    "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
    "load_weights_into_gpt(gpt, params)\n",
    "gpt.to(device);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "46eda9ea-ccb0-46ee-931b-3c07502b2544",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import urllib.request\n",
    "from previous_chapters import create_dataloader_v1\n",
    "\n",
    "\n",
    "file_path = \"the-verdict.txt\"\n",
    "url = \"https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt\"\n",
    "\n",
    "if not os.path.exists(file_path):\n",
    "    with urllib.request.urlopen(url) as response:\n",
    "        text_data = response.read().decode('utf-8')\n",
    "    with open(file_path, \"w\", encoding=\"utf-8\") as file:\n",
    "        file.write(text_data)\n",
    "else:\n",
    "    with open(file_path, \"r\", encoding=\"utf-8\") as file:\n",
    "        text_data = file.read()\n",
    "\n",
    "\n",
    "# 比例\n",
    "train_ratio = 0.90\n",
    "split_idx = int(train_ratio * len(text_data))\n",
    "train_data = text_data[:split_idx]\n",
    "val_data = text_data[split_idx:]\n",
    "\n",
    "\n",
    "torch.manual_seed(123)\n",
    "\n",
    "train_loader = create_dataloader_v1(\n",
    "    train_data,\n",
    "    batch_size=2,\n",
    "    max_length=GPT_CONFIG_124M[\"context_length\"],\n",
    "    stride=GPT_CONFIG_124M[\"context_length\"],\n",
    "    drop_last=True,\n",
    "    shuffle=True,\n",
    "    num_workers=0\n",
    ")\n",
    "\n",
    "val_loader = create_dataloader_v1(\n",
    "    val_data,\n",
    "    batch_size=2,\n",
    "    max_length=GPT_CONFIG_124M[\"context_length\"],\n",
    "    stride=GPT_CONFIG_124M[\"context_length\"],\n",
    "    drop_last=False,\n",
    "    shuffle=False,\n",
    "    num_workers=0\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "4e3574a2-687d-47a2-a2f6-457fe9d595f1",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training loss: 3.7547486888037787\n",
      "Validation loss: 3.5596182346343994\n"
     ]
    }
   ],
   "source": [
    "from gpt_train import calc_loss_loader\n",
    "\n",
    "torch.manual_seed(123) # 确保可复现\n",
    "train_loss = calc_loss_loader(train_loader, gpt, device)\n",
    "val_loss = calc_loss_loader(val_loader, gpt, device)\n",
    "\n",
    "print(\"Training loss:\", train_loss)\n",
    "print(\"Validation loss:\", val_loss)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "96485d6b-bf1f-4bc0-a53f-73b08d85726e",
   "metadata": {},
   "source": [
    "我们也可以对最大的GPT-2模型执行相同操作，但不要忘记更新上下文长度："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "1a79a4b6-fe8f-40c2-a018-e731dcf391b3",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "checkpoint: 100%|███████████████████████████| 77.0/77.0 [00:00<00:00, 43.5kiB/s]\n",
      "encoder.json: 100%|███████████████████████| 1.04M/1.04M [00:00<00:00, 2.75MiB/s]\n",
      "hparams.json: 100%|█████████████████████████| 91.0/91.0 [00:00<00:00, 60.2kiB/s]\n",
      "model.ckpt.data-00000-of-00001: 100%|█████| 6.23G/6.23G [06:02<00:00, 17.2MiB/s]\n",
      "model.ckpt.index: 100%|████████████████████| 20.7k/20.7k [00:00<00:00, 171kiB/s]\n",
      "model.ckpt.meta: 100%|████████████████████| 1.84M/1.84M [00:00<00:00, 4.27MiB/s]\n",
      "vocab.bpe: 100%|████████████████████████████| 456k/456k [00:00<00:00, 1.73MiB/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training loss: 3.3046312861972384\n",
      "Validation loss: 3.1195147037506104\n"
     ]
    }
   ],
   "source": [
    "settings, params = download_and_load_gpt2(model_size=\"1558M\", models_dir=\"gpt2\")\n",
    "\n",
    "model_name = \"gpt2-xl (1558M)\"\n",
    "NEW_CONFIG = GPT_CONFIG_124M.copy()\n",
    "NEW_CONFIG.update(model_configs[model_name])\n",
    "NEW_CONFIG.update({\"context_length\": 1024, \"qkv_bias\": True})\n",
    "\n",
    "gpt = GPTModel(NEW_CONFIG)\n",
    "gpt.eval()\n",
    "\n",
    "load_weights_into_gpt(gpt, params)\n",
    "gpt.to(device)\n",
    "\n",
    "torch.manual_seed(123)\n",
    "train_loss = calc_loss_loader(train_loader, gpt, device)\n",
    "val_loss = calc_loss_loader(val_loader, gpt, device)\n",
    "\n",
    "print(\"Training loss:\", train_loss)\n",
    "print(\"Validation loss:\", val_loss)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3a76a1e0-9635-480a-9391-3bda7aea402d",
   "metadata": {},
   "source": [
    "# 练习5.6 用更大的模型"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b3d313f4-0038-4bc9-a340-84b3b55dc0e3",
   "metadata": {},
   "source": [
    "- 在主章节中，我们尝试了最小的GPT-2模型，只有124M参数\n",
    "- 这样做的原因是为了将资源需求保持在最低限度\n",
    "- 然而，您可以通过最少的代码更改轻松地尝试更大的模型\n",
    "- 例如，在第5章中，加载1558M模型而不是124M模型，我们只需更改以下两行代码：\n",
    "\n",
    "```python\n",
    "settings, params = download_and_load_gpt2(model_size=\"124M\", models_dir=\"gpt2\")\n",
    "model_name = \"gpt2-small (124M)\"\n",
    "```\n",
    "\n",
    "- 最新的代码如下\n",
    "\n",
    "\n",
    "```python\n",
    "settings, params = download_and_load_gpt2(model_size=\"1558M\", models_dir=\"gpt2\")\n",
    "model_name = \"gpt2-xl (1558M)\"\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "31e0972b-e85e-4904-a0f5-24c3eacd5fa2",
   "metadata": {},
   "outputs": [],
   "source": [
    "import tiktoken\n",
    "import torch\n",
    "from previous_chapters import GPTModel\n",
    "\n",
    "\n",
    "GPT_CONFIG_124M = {\n",
    "    \"vocab_size\": 50257,   # 词汇表大小\n",
    "    \"context_length\": 256, # 缩短的上下文长度（原始值：1024）\n",
    "    \"emb_dim\": 768,        # 嵌入维度\n",
    "    \"n_heads\": 12,         # 注意力头数\n",
    "    \"n_layers\": 12,        # 层数\n",
    "    \"drop_rate\": 0.1,      # 丢弃率\n",
    "    \"qkv_bias\": False      # 查询-键-值偏置\n",
    "}\n",
    "\n",
    "\n",
    "tokenizer = tiktoken.get_encoding(\"gpt2\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "b641ee88-f9d4-43ec-a787-e34199eed356",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "File already exists and is up-to-date: gpt2/1558M/checkpoint\n",
      "File already exists and is up-to-date: gpt2/1558M/encoder.json\n",
      "File already exists and is up-to-date: gpt2/1558M/hparams.json\n",
      "File already exists and is up-to-date: gpt2/1558M/model.ckpt.data-00000-of-00001\n",
      "File already exists and is up-to-date: gpt2/1558M/model.ckpt.index\n",
      "File already exists and is up-to-date: gpt2/1558M/model.ckpt.meta\n",
      "File already exists and is up-to-date: gpt2/1558M/vocab.bpe\n"
     ]
    }
   ],
   "source": [
    "from gpt_download import download_and_load_gpt2\n",
    "from gpt_generate import load_weights_into_gpt\n",
    "\n",
    "\n",
    "model_configs = {\n",
    "    \"gpt2-small (124M)\": {\"emb_dim\": 768, \"n_layers\": 12, \"n_heads\": 12},\n",
    "    \"gpt2-medium (355M)\": {\"emb_dim\": 1024, \"n_layers\": 24, \"n_heads\": 16},\n",
    "    \"gpt2-large (774M)\": {\"emb_dim\": 1280, \"n_layers\": 36, \"n_heads\": 20},\n",
    "    \"gpt2-xl (1558M)\": {\"emb_dim\": 1600, \"n_layers\": 48, \"n_heads\": 25},\n",
    "}\n",
    "\n",
    "model_name = \"gpt2-xl (1558M)\"\n",
    "NEW_CONFIG = GPT_CONFIG_124M.copy()\n",
    "NEW_CONFIG.update(model_configs[model_name])\n",
    "NEW_CONFIG.update({\"context_length\": 1024, \"qkv_bias\": True})\n",
    "\n",
    "gpt = GPTModel(NEW_CONFIG)\n",
    "gpt.eval()\n",
    "\n",
    "settings, params = download_and_load_gpt2(model_size=\"1558M\", models_dir=\"gpt2\")\n",
    "load_weights_into_gpt(gpt, params)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "c98f56f4-98fc-43b4-9ee5-726e9d17c73f",
   "metadata": {},
   "outputs": [],
   "source": [
    "from gpt_generate import generate, text_to_token_ids, token_ids_to_text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "b1f7853c-6e81-4f1f-a1d0-61e2c7d33a20",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Output text:\n",
      " Every effort moves you toward finding an ideal life. You don't have to accept your current one at once, because if you do you'll never\n"
     ]
    }
   ],
   "source": [
    "torch.manual_seed(123)\n",
    "\n",
    "token_ids = generate(\n",
    "    model=gpt,\n",
    "    idx=text_to_token_ids(\"Every effort moves you\", tokenizer),\n",
    "    max_new_tokens=25,\n",
    "    context_size=NEW_CONFIG[\"context_length\"],\n",
    "    top_k=50,\n",
    "    temperature=1.5\n",
    ")\n",
    "\n",
    "print(\"Output text:\\n\", token_ids_to_text(token_ids, tokenizer))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
